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CHAPTER 1 


INTRODUCTION 

Massively Parallel Processing (MPP) computers offer the prospect of significant 
performance gains over conventional supercomputers, which are now approaching 
hard physical speed limits inherent in their technology. However, the need for faster 
computations continues to grow. As a consequence, massively parallel computers 
are being developed as a possible solution. The Connection Machine-5 (CM-5) is 
one such MPP computer, which may provide better performance than conventional 
supercomputers and may replace the fastest conventional supercomputer in the 
near future. In fact, the CM-5 computer, with maximum 16k nodes installed, is a 
2 TFLOPS computer in theory. 

Computational fluid dynamics (CFD) is one of the areas which requires super- 
fast computational power because the problems typically involve a large number of 
calculations. The massively parallel computers have the potential to become the 
main computational tool for CFD; it may replace the conventional supercomputers 
in the near future. Hu and Jackson [4], and Chriske and Boguez [2], have made the 
performance study for a two dimensional source panel method calculation, and the 
study shows that the parallel code achieved a high performance. The potential of 
MPP computers is being realized. Computational Fluid Dynamics (CFD) is being 
enhanced immensely by the advent of MPP computers. Various CFD problems are 
being solved by MPP computers in a more efficient manner. Thus, the study of the 
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performance of MPP computers on CFD problems has become a very important 
issue. 

In the realm of Computational Fluid Dynamics, the finite difference method 
(FDM) and the finite volume method (FVM) are major methods of solving incom- 
pressible and compressible flows, including transonic flow problems. The FDM and 
FVM for the Navier-Stokes equations, though successful, are expensive to imple- 
ment. However, the Integral Equation Method(IEM) applied to the potential equa- 
tion formulation for incompressible and some compressible flows, including transonic 
flows, represents an efficient alternative to FDM and FVM. 

The IEM, based on Green’s theorem, represents the solution in terms of in- 
tegrals over the computational domain and boundary of the domain. The IEM 
computation for typical aerodynamics problems involves evaluating a large number 
of integrals, which is expensive to evaluate on serial-architecture supercomputers. 
In contrast, these integral computations may be done less expensively in a massively 
parallel computing environment. Therefore, there is a need to study its performance. 

The IEM for linearized subsonic and supersonic flow computations has been in 
use since the 1960’s and has become an indispensable tool in aerodynamic analysis 
and design. A review of the method is given by Kandil and Yates [6]. Since the 
1970 s, several numerical schemes implementing the IEM for nonlinear compressible 
including transonic flows have been developed by some investigators [5-6]. 

For example, Hu [5] developed a numerical scheme using the IEM in solving 
the full- potential equation for three dimensional compressible including transonic 
flows. The numerical scheme is implemented on the conventional supercomputer 
Cray-YMP. 

The objective of this work is to implement, including the necessary modica- 
tions, the numerical scheme of the IEM developed by Hu [5] in a massively parallel 
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computing environment. A comparative study on the performance of the parallel 
code and the serial code for three dimensional calculations is made. 

Chapter 2 presents the physical and mathematical foundation of the problem, 
along with the integral equation solution and its associated numerical algorithm. 
Chapter 3 discusses the CM-5 architecture, parallel CM-FORTRAN computer lan- 
guage, and implementation of parallel computation of the IEM. The performance 
study is made in Chapter 4. Finally, the concluding remarks on this investigation 
are given in Chapter 5. 
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CHAPTER 2 


MATHEMATICAL FORMULATION 

2.1 The Physical Problem 

The physical problem being studied is the flow around three dimensional aero- 
dynamic configurations. Incompressible and compressible flows are considered. In 
order to study flows, it is necessary to study the laws of conservation of mass, mo- 
mentum and energy, which lead to the fundamental governing equations for fluid 
flows. The following derivation of the fundamental governing equations are given 
by Anderson [1], The derivation of the full-potential equation is based on Hu [5] 
but is modified to meet the specifications of the problem solved. 

2.2 Fundamental Equations of Fluid Flows 

There are three fundamental equations for fluid flows, the continuity equation, 
the momentum equation and the energy equation. Each equation has its physical 
interpretation. 

2.2.1 The Continuity Equation 

The physical principle of the continuity equation can be stated that the mass 
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can be neither created nor destroyed. The principle is expressed mathematically as 


pdti + j pV ■ dS = 0 (2.1) 

where the first term is the time rate of increase of mass inside control volume, V; 
the second term represents the flow across the boundary, S, of the control volume, 
V ; V is the flow field velocity vector, p the density and t the time. Applying the 
divergence theorem, Equation (2.1) becomes 

Jji^ + V ■ (pV)}M = 0 (2.2) 

For arbitrary control volume, V , Equation (2.2) implies that 

^ + S7-(pV) = 0 (2.3) 

which is the continuity equation in differential equation form. 

2.2.2 The Momentum Equation 


The physical principle of the momentum equation is that the force equals time 
rate of change of momentum. The equation in integral form is 


d_ 

dt 


f pV(N + 

Jv 


Is 


(, pV -dS)V 



pfbdN + F v 


(2.4) 


where p is the pressure; dS is the normal vector of the surface elements. J v pV <N 

is the time rate of change of momentum contained at any instant inside the control 
volume due to flow fluctuations. The net flow of momentum out of the control 
volume through surface S is the summation of the above elemental contributions 
and is expressed as f s (pV ■ dS)V. — J s pdS is the pressure force acting on the 
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control volume. f v pft,d\/ is the body force and F v is the viscous force. Applying 
the divergence theorem to surface integral terms of Equation (2.4), the momentum 
equation becomes 

+ v '(^) + VP - Pfb - fv](N = o (2.5) 

— * 

where f v represents the viscous force term. Since the control volume V is arbitrarily 
chosen, Equation (2.5) gives 

^ + v • pvv + VP - pfb -fv = 0 (2.6) 

which is the momentum equation in differential equation form. 

2.2.3 The Energy Equation 

The physical principle of the energy equation is that energy can be neither 
created nor destroyed; it can only change form. The energy equation is expressed 
as 

j 'qpdi+Q„- I ' P VdS+ I ’ p(fV)M+W„ = ^ / p(c+y)<N+ j plp + ^lVdS 

(2.7) 

where W v is work due to viscous effects. Q v is due to viscous effects, q is the 
volumetric rate of heat addition per unit mass. Q f is the total rate of heat addition 
to the control volume due to viscous effects. W f is the total work done by viscous 
effects on the control surface S. t is the time. 

The partial differential form of Equation (2.7) is given by 

+ ^) + V ■ We + ^-)V) = pq-V- ( pV ) + p(h ■ V) + Q‘ + W’ (2.8) 
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2.3 Full-Potential Equation 


The Equations (2.3), (2.6) and (2.8) are combined to form the Navier-Stokes 
equations in the computational fluid dynamics, although in the theoretical fluid me- 
chanics, Equation (2.6) alone is called the Navier-Stokes equations. They represent 
the most general governing equations for fluid flows. However, for some flows, there 
are simplified governing equations. Flows in which vorticity is zero are known as 
irrotational flows. Fluid elements of irrotational flow field have no angular velocity. 
The mathematical relation representing irrotational flow is 

V x V = 0 (2.9) 

which means that the curl of velocity vector field V is zero. Comparing Equation 
(2.9) with the vector identity, y x y«l> = 0, it is found that 

V = y$ (210) 

This implies that for irrotational flows, there exists a scalar function d*. the velocity 
potential. Inviscid flows are flows in which viscous effects axe negligible. Com- 
pressible flows as opposed to incompressible flows are flows in which the density p 
varies. 

For unsteady inviscid compressible flows with negligible body forces, the con- 
tinuity and momentum equations, Equations (2.3) and (2.6) reduces to 


dp 

dt 

+ v ■ 

VP + P V -V = 0 

(2.11) 

dV 


- 1 

(2.12) 

dt 

+ v 

• + - VP = 0 

p 
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respectively. 

Using the identity, 


- V 2 

V ■ \/V = X/— - V x (v X V) 


(2.13) 


Equation (2.12) becomes 

dV V 2 1 

^ + V(y)-^x(VxV)+-VP = 0 (2-14) 

For steady, irrotational flows, Jy = 0 and v x U = 0. Equation (2.14) reduces 
to 

V(v)+“VP = 0 ( 2 - 15 ) 

2 p 

Using the velocity potential, <£, with 


V = 


then Equation (2.15) becomes 


= 0 

2 p 


Integrating Equation (2.16) with respect to space yields 

(V *) 2 


+ 


/?- 


For barotropic fluid, there is the relation 


f dp a 2 

J P « - 1 


(2.16) 


(2.17) 


(2.18) 


where a is the speed of sound and « is the gas specific heat ratio. If the fluid is at 
rest at infinity, then it follows that 


9 = 


K - 1 


(2.19) 
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where subscript oo refers to the infinity condition. Substituting Equation (2.18) 
and (2.19) into (2.17) yields 


( v $) 2 , « 2 = 

2 K — 1 K — 1 


( 2 . 20 ) 


Assuming that the flow is isentropic and the isentropic relation, which is an- 
other form of the energy equation, 



Poo 


a 

~T 


a 




( 2 . 21 ) 


is used into Equation (2.20) to get 


— = [i-4r^(v*) 2 ] :riT (2.22) 

poo 2 

By defining the Mach number M 0 0 as ratio of and a , Equation (2.22) takes the 
form 

— = (1 - [^] 2 (Vf) 2 ]* (2.23) 

Poo ^ 1 oo 

For steady, irrotational flows, the continuity equation, Equation (2.11), reduces 
to 


• VP + p V * V $ — o 


(2.24) 


which can be rewritten in the following form 

y 2 $ = -- y p • (2.25) 

P 

After introducing the characteristic parameters of V IXJ , and the wing root chord 
length c, Equations (2.25) and (2.23) take the dimensionless form as follows 


V 2< f> = G 


(2.26) 


with 


G = 


— ~ V P • 

P 


(2.27) 
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and 


*>=[!- 


(2.28) 


Equation (2.26) along with Equations (2.27) and (2.28) is the full-potential equa- 
tion for inviscid compressible flows. For transonic flows, Equation (2.26) along with 
Equations (2.27) and (2.28) is a mixed elliptic-hyperbolic partial differential equa- 
tion. It should be noted that equation (2.26) is written in the poisson’s equation 
form, where the compressibility, G, is recognized as an inhomogeniety instead of a 
nonlinearity. In order to study the mixed nature, one may investigate the Transonic 
Small Disturbance(TSD) equation, which is obtained from the full-potential equa- 
tion under the small disturbance assumption. The TSD equation can be written 
as 


[1 - M 2 (x,y,z )] 


d 2 $ d 2 $ d 2 $ 


+ 


+ 


0 


(2.29) 


8x 2 dy 2 dz 2 

From Equation (2.29) when the local Mach number, M(x,y,z) is larger than 
one, the equation is a hyperbolic partial differential equation. When M(x,y,z) is 
less than one, the equation is a elliptic partial differential equation. 


2.4 Boundary Conditions 


The boundary conditions are surface no-penetration condition, Kutta condi- 
tion, infinity condition, and wake kinematic and dynamic conditions. They are 
described as follows: 

V ■ n g — Q on g(r) = 0 (2.30) 

A C p \ sp = 0 (2.31) 

— > 0 away from q(r) =0 and w(r) = 0 (2.32) 
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V * n w = 0 on w{r) = 0 


(2.33) 


and 


A C p =0 on w(r) = 0 (2.34) 

where n g is the unit normal vector of the wing surface, g(r) = 0; C p is the surface 
pressure coefficient; the subscripts sp refers to the edges of separation; and w(r) = 0 
is wake surface(s). For nonlifting symmetric flow where the wake surface can be 
neglected, the required boundary conditions are surface no-penetration and infinity 
condition given by Equations (2.30) and (2.31), respectively. 

2.5 Integral Equation Solution 


By using the Green’s theorem, the integral equation solution in terms of the 
velocity field for nonlifting flows is given by 


V(x,y,z) = Vo, 


-ill. 

+ ^IIi 

_ J _ f f 

4 tt J Js d 2 


Qgi^hO 

d 2 

G(U 7,0 


e d ds(£,r],C) 

e d d£drjd( 


e d ds((, rj,0 


(2.35) 


where is the free-stream velocity; q is the surface source distribution; the sub- 
script, 5, refers to the shock surface; ds is the infinitesimal surface area; the vector 
d is given by d — (x — £ )i + (y — 7])j + (z — ()k; and is defined by = d/\d\. 
It can be seen that the infinity condition is automatically satisfied by the integral 
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equation solution. 


2.6 Discretization 

To solve the problem using field-panel method, Equation (2.35) must be dis- 
cretized. The discretization in terms of field-panels is as follows 

— boo 


1 

47T 


LG NG 


I I ^e d ds(Z,r], C) 

i=l*=l J J 9i,k a 

LV MV NV 

i=lj=lk=l J J JV Lj,k 

^ MS NS r . 1 

+ / / ^ddsU,rjX) 

j=lk=l J Jb ) k 


(2.36) 


where the indices,?, j and k refer to the surface and field panels; LG x NG is the 
total number of wing surface panels; LV x MV x NV is the total number of field 
panels; and MS x NS is the total number of shock surface panels. The constantly 
distributed surface and volume sources are used. 

2.7 Numerical Scheme 

2.7.1 Panel Method for Incompressible Flows 


The standard panel method is used to solve the incompressible flow problems. 
In Equation (2,36), G = 0 for incompressible flows, and hence q 3 =0. Equation 
(2.36) reduces to 


LG NG 


V{x,y,z) = Voo - — Y^2 q 9i.k / / -pCddsfari, () (2.37) 

i=ut=i J J 9i,k 
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After applying the wing surface zero-normal-velocity boundary condition at each 
control point (CP) of all panels, 


C(x, y, z) ■ n = 0 (2.38) 

a system of equation is obtained, 

MM = {«} (2.39) 

where [A] is N x N aerodynamic influence coefficient matrix, and N = LG x NG; 
{5} is a Nxl unknown vector matrix containing qj for j = 1 to N; and { B } is a N 
x 1 known vector matrix which is contributed from Vqq. The solution procedure 
of the problem using source panel involves three major steps: Step 1 - evaluation 
of integrals for N 2 times to construct matrices [A] and also { B }; Step 2 - solving 
the resulting dense linear system of Equation (2.39); and Step 3 - post-processing 
of aerodynamic calculations. 

Step 1 involves evaluating a large number of integrals. The total number of 
integrals can be very large for aerodynamic problems. It can be in the order of 10 8 
if LG x NG = 100 x 100. An important feature of step 1 is that the calculation 
for each (rr, y, z) and each (£, y, Q can be performed simultaneously for all (x, y, z) 
and all (f,y, C)* This feature of panel method calculation leads itself in a natural 
way for processing data in a parallel computing environment. 

2.7.2 Field Panel Method for Compressible Flows 

The field Panel method is used to solve the problem. The scheme developed by 
Hu [5] in serial FORTRAN code is inherently used in developing the parallel CM 
FORTRAN version. 

The solution procedure is as follows: 
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Step 1 - Standard Panel Method calculations. 

Step 2 - Computation of the Initial Value of the Compressibility: Unlike the 
incompressible case where the density p is constant, and the compressibility, G, is 
zero, G is updated by using an iterative method. A central difference scheme is 
used to calculate the derivative of p. 

Step 3 - Enforcing the Boundary Conditions: The boundary conditions are 
enforced in order to develop the appropriate matrices, where the compressibility, 
obtained in Step 2 and the source distribution obtained in Step 1 are used. The 
solution gives q g distributions. 

Step 4 - Calculation of the Surface Pressure Coefficients: Once G and the 
source distribution q g are obtained, they are used in Equation (2.36) to calculate the 
velocity at each control point. The surface pressure coefficients are then calculated. 
The pressure coefficient is defined by 


C n 


kMIc Poo 


(— -i) 


(2.40) 


where p and p are all dimensional quantities. By introducing the isentropic flow 
relation, 

JL = (JLy (2.41) 

Poo Poo 

and writing as the dimensionless density, p, Equation (2.41) becomes 

0 Poo 

c -^- 1) (2 ' 42) 

Equation (2.42) is used to compute pressure coefficients at each control point of the 
wing surface. 
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Step 5 - Steps 2-4 are repeated until convergence. 


2.8 Significance of Computing In a Parallel Environment 

The IEM computations for typical aerodynamics problems involves evaluating 
a large number of integrals over the computational domain and the boundaries 
of the domain as indicated in Equation (2.36). The computation of this large 
number of integrals on serial-architecture supercomputers is very computationally 
expensive. On the other hand, it is noticed that all these integral computations 
can be computed simultaneously in a massively parallel processing environment. 
Therefore, it is imperative to study the computational performance of the IEM on 
a MPP environment. An overview of the MPP and CM Fortan is given in the 
following chapter. 
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CHAPTER 3 


PARALLEL IMPLEMENTATION 
3.1 Connection Machine - 5 

The connection machine is a MPP computer that is the brainchild of W . Daniel 
Hillis, Marvin L. Minsky and Scott E. Falman, collectively producing the first proto- 
type in 1985 [3]. In MPP computers, many small processors are made to work simul- 
taneously, each accompanied by small memory. Parallel processing allows memory 
capacity and processing capacity to be utilized yielding high efficiency. Each proces- 
sor is much less powerful and less expensive to produce than a typical conventional 
supercomputer processor, but in tandem they can achieve an extremely high perfor- 
mance which conventional supercomputers can not achieve [3]. Since each processor 
is bearing down on the same problem there has to be a means by which they com- 
municate. The user has the option to use the SIMD( Single Instruction Multiple 
Data) mode of data parallel computing or the MIMD(Multiple Instruction Multi- 
ple Data) mode of message passing computing. The present study of performance 
uses the SIMD mode where the communication is solely controlled by the computer 
implicitly as opposed to MIMD mode where the user explicitly controls necessary 
communi cat ion . 

The Connection Machine CM-5 system is a scalable distributed-memory mul- 
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tiprocessor system [8]. The architecture of the CM-5 is optimized for parallel pro- 
cessing of large, complex problems [8]. The system is designed to operate on large 
amounts of data. The major hardware elements of the system include Front-End 
computers to provide developing and execution environments and a parallel pro- 
cessing unit, which consists of multiple nodes, to execute parallel operations [8]. 
The basic computational unit is the compute node. Compute nodes work in par- 
allel. Each node has its own memory. The processing nodes are supervised by 
a control processor; it broadcasts blocks of instructions to the parallel processing 
nodes and then initiates execution. The CM-5 parallel processing nodes are divided 
into groups, known as partitions. Every control processor and parallel processing 
node in the CM-5 is connected to two scalable interprocessor communication net- 
works, designed to give low latency combined with high bandwidth in any possible 
configuration a user may wish to apply to a problem [8]. 

Each compute node contains 1 spare processor, four vector units(vector length 
= 16) with 32 Megabytes(MB) of memory. A theoretical peak performance is 128 
MFLOPS per node for a total of 16 GFLOPS no the 128 node CM-5 [8]. 

3.2 CM-Fortran 

The software consists of CM-FORTRAN which is a high performance FOR- 
TRAN language, the CMSSL (Connection Machine Scientific Software Library), a 
debugging software package, Prism, and a host of others. The CM-FORTRAN lan- 
guage is an implementation of FORTRAN-77 supplemented with array-processing 
extensions from the standard FORTRAN-90 [9]. These array-processing features 
map naturally onto SIMD data parallel architecture of the CM-5 system, since 
the CM-FORTRAN allows array elements to be evaluated simultaneously. The 
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most important difference between CM-FORTRAN and standard FORTRAN is 
the treatment of entire arrays as objects, thus explicit indexing in CM-FORTRAN 
is not always necessary. For example, it is not necessary to write DO-LOOPS or 
other such control constructs to have the operation repeated for each element of an 
array. 

3.3 Parallel Implementation 

A three dimensional, steady, incompressible and compressible, source field 
panel method is converted to a parallel CM FORTRAN program. Before man- 
ual conversion, the CMAX translator is used to partially convert some of the serial 
FORTRAN code into parallel CM-FORTRAN code under the data parallel (SIMD) 
programming mode. Since CMAX is relatively new and not very well developed yet 
a lot of the conversions made by CMAX have to be fine tuned. Hence, most of the 
conversion is done manually. 

Prior to the description of the code conversion, it is necessary to describe some 
properties of CM-FORTRAN. Arrays in CM FORTRAN come in the following two 
forms, Front End(FE) arrays and Connection Machine(CM) arrays. Front-End Ar- 
rays are for standard FORTRAN operations and are stored on the Partition Man- 
ager and are called Front End (FE) arrays. CM arrays are for parallel FORTRAN 
operations and are stored across the local memories of the processing nodes [9]. 
While using FE and CM arrays one should avoid using an array as both an array 
object and a subscripted array. An array used as an array object always resides 
on the parallel processing nodes and hence has a CM home. If a CM array is used 
as subscripted array (FORTRAN), the system moves the CM array to the FE, one 
element at a time, to perform the sequential operations. This transfer of CM arrays 
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to FE arrays degrades the performance [9]. Intrinsic functions and other attributes 
of CM- FORTRAN are listed along with examples in the following sections. 

3.3.1 Parallelization of Incompressible Flow Case 

Step 1 of the incompressible source panel method is coded in subroutine VEL- 
WING.f, which contructs the linear dense matrices [A] and {B}. The serial FOR- 
TRAN subroutine VELWING.f is converted to fully parallelized code. The original 
subroutine VELWING.f is partially listed in Figure la and its parallel counterpart 
VELWING.fcm in Figure lb. 

In subroutine VELWING.f there are two doubly nested DO-LOOPS, nested IF 
statements and a call to subroutine VWS.f, which is called from the inner doubly 
nested DO-LOOPS. Parallel WHERE assignments are implemented to replace the 
serial FORTRAN IF command. Instead of calling subroutine VWS.f, as in the 
serial version, VELWING.fcm replaces the call to VWS.f with a fully parallelized 
version of the VWS.f subroutine. All four DO-LOOPS are efficiently replaced by 
the intrinsic SPREAD functions. The FORALL command is used to manipulate 
the four dimensional arrays developed by implementing the SPREAD function. A 
brief overview of each CM-5 attributes and a few examples are given below. 

3. 3. 1.1 Attributes of the SPREAD Function 

The SPREAD function is a transformational function. It broadcasts copies 
of a source array(or scalar) along a dimension (as in forming a book from copies 
of single page), yielding an array of rank one greater than the source, with values 
replicated from the source [9]. 

Essentially the spread function is used to replace the nested DO-LOOPS by 
transforming all arrays within the two outer loops and the two inner loops into 
four dimensional arrays. For the smaller problem tested where LGxNG = 24x12, 
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DO-LOOP 1 goes from 1 to 24 and DO-LOOP 2 goes from 1 to 12. DO-LOOP 3 
goes from 1 to 24 and DO-LOOP 4 goes from 1 to 6. The arrays formulated in the 
outer loops are copied along the third and fourth dimension. The arrays formulated 
in the inner loops are copied along the first dimension. The arrays are spread in 
this fashion to ensure correspondence. That is to say when arrays formulated in 
the outer loops 1 and 2 are spread to four dimensional arrays, and when arrays 
formulated within the inner loops 3 and 4 which are spread to four dimensional 
arrays, computations involving arrays originating from the outer loops and the inner 
loops will correspond correctly. Now all computations can be done simultaneously in 
four dimensions instead of serially by using time consuming nested DO-LOOPS. To 
see how the SPREAD function is applied, one may look at array xcm4(24, 12, 24, 6) 
in Figure lb. In the original code listed in Figure la, it is calculated serially within 
the DO-LOOPS 1 and 2 as scalar xc. In the parallel version xc(nr, nc ) is renamed 
xcm(nr,nc). Its elements are calculated simultaneously by using the feature of 
CM-FORTRAN that simply lets you calculate the array instanteously. Hence, using 
xcm { : nr, : nc)— ... will compute and store the all value of xcm( 1, 1), ...,xcm(24, 12) 
in array xcm simultaneously. 

Since xc is located within the DO-LOOPS 1 and 2, but not the two inner 
DO- LOOPS 3 and 4 of VELWING.f, its values in the parallel version are initially 
located in the parallel array xcm( 24, 12). Then they are copied along the third 
dimension via the SPREAD function. Thus, xcm3 contains 24 copies of the values 
in xcm , where xcm3 is the three dimensional array xcm3(24, 12, 24). Likewise, 
xcm3 is copied along the fourth dimension. It contains 6 copies of the values of 
xcm3. xcmA is the four dimensional array xcm4(24, 12,24, 6). 

The scalar xf calculated serially within the inner most DO-LOOPS 11 and 
12 of the origin version can be calculated simultaneously in the parallel version. 
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In the parallel version it is initially the array x fm( 24, 6). Then via spreading it 
along the 1st dimension and making 12 copies of xfm along the first dimension, 
it becomes xfm3 which stores the values in x/m3(12, 24, 6). Then 24 copies of 
xfm.3 is spread along the first dimension of xfmA. yielding x/m4(24, 12, 24, 6). Now 
these four dimensional arrays are manipulated as complete objects instead of being 
manipulated element by element. 

3. 3. 1.2 Attributes of The FORALL Statement 

The FORALL statement is an elemental array assignment statement used 
to specify an array asignment in terms of array elements of array sections. The 
FORALL statement effectively describes a collection of assignments to be executed 
elementally [9]. 

The FORALL statement used in VELWING.fcm and also VELWING2.fcm 
predominantly calculates in parallel the arrays and scalars that in VWS.f subroutine. 
It assigns elements to the necessary arrays. For example, 

zl(is,js,ir,ji') where, is = 1 to 24, js = 1 to 12, ir = 1 to 24 and jr = 1 to 6. 

3. 3. 1.3 Attributes of The WHERE Statement 

The WHERE construct qualifies the evaluation of expressions and assignments 
of values in several array assignment statements [9]. It is very similar to the well 
known IF construct of standard FORTRAN. Unlike the IF statement it requires 
everything within the WHERE construct to be an array of the same dimension of 
the masked expression. For example referring to Figure 3b the WHERE construct 
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IS 


WHERE( distm4.lt .farfd) 

WHERE(distm4.1t. 0.0001) 

ELSEWHERE 

ENDWHERE 

ENDWHERE. 

where distmA is the array used in the mask statements. Since it is a four 
dimensional array all array assignments and evaluations within the expression have 
to be the same dimension of array distm4(24, 12,24,6). 

All calculations needed to produce the desired results of subroutine VWS.f are 
placed in the parallel routines before the WHERE construct. The parallel version 
of VELWING.f drastically out-performs the original serial FORTRAN code. The 
results are discussed in Chapter 4. 

3.3. 1.4 Attributes of the Parallel Matrix System Solver 

The CMSSL is a rapidly growing set of numerical routines that support com- 
putational applications while exploiting the massive parallelism of the Connection 
Machine system [10]. It includes dense and sparse matrix operations; routines for 
solving dense, banded, and sparse linear systems; eigensystem analysis routines[10]. 
Efficient use of the Connection Machine architecture is accomplished through a 
careful choice of data layout, efficient implementation of interprocessor data mo- 
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tion, and careful management of the local memory hierarchy and data paths in 
each processor [10]. 

The serial Gauss elimination routine is replaced by a call of subroutine Gauss 
elimination(LU decomposition) with partial pivoting. The system to be solved 
consists of dense CM array [A] and CM array {B}- The CMSSL Gauss elimination 
routine permits the change of the blocking factor. The result of the performance 
as it relates to the change in the blocking factor is investigated and the results are 
presented in Chapter 4. 

3. 3. 1.5 Parallelization of Post-Processing Calculations 

Subroutine PRESS. f is used to calculate the pressure distribution over the 
surface of the wing and it is listed in Figure 2a. The parallel version is listed in 
Figure 2b. It is partially parallelized and this degrades the performance. Most of 
the subroutine is written in parallel, but most of the parallel computations occur 
inside of nested DO-LOOPS which are not parallel. This effects the performance 
immensely. 

3.3.2 Parallelization of Compressible Flow Case 

The serial FORTR AN program is based on the field panel method. The method 
is modified from the standard panel method so that the compressibility is accounted 
for by using an iterative scheme in which the density, and therefore G is updated 
during each iteration. 

Subroutine GVICAL.f calculates the volume integrals, which is listed in Figure 
3a. It consists of six nested DO-LOOPS. The inner DO-LOOPS, 11, 12, 13, each 
has conditional IF statements used to determine values for variables X, Y, Z. The 
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inner DO-LOOPS does the Near-field calculations. The outer nested three DO- 
LOOPS 6,7,8 are used to fix a receiver point at the lower right downstream corner 
of the control volume V. Also the surface and field velocities are saved for usage 
throughout the code. 

GVICAL3.fcm is a fully parallelized version of GVICAL.f. It is listed in 
figure 3b. All calculation done in the DO-LOOPS are converted to FORALL 
statements or parallel array assignments. The IF statements assigning values to 
X, Y, Z, within the DO-LOOPS 11, 12, and 13 are replaced by array assignments 
xl00(20, 16, 18), *101(20, 16, 18), and 

2/100(20, 16, 18), 2/101(20, 16, 18), and 2100(20, 16, 18), 2101(20, 16, 18). 

Other calculations within DO-LOOPS 11, 12, and 13 of GVICAL.f are replaced 
by parallel array assignment statements. The IF statement using the masks 

abs(xff), abs(yff) and 

abs(zf f ) is replaced by nested WHERE statements with arrays 

*//100, J///100, and 2//100 replacing 

xff,yff , and zff respectively. 

Scalars xs, ys , and zs in subroutine GVICAL.f are replaced by arrays 

£$100(20, 16, 18), ysl00(20, 16, 18) and 

2$100(20, 16, 18), respectively, which are all formulated by using the FORALL 
command. Although the code is expanded for 82 lines of FORTRAN code to 388 
lines of CM FORTRAN code, the parallel version is the best version for application 
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on MPP computers. Also it out-performs the original serial code. 

Subroutine VELFDG.f listed in Figure 4a takes the summation of all of the 
integrals evaluated by subroutine GVICAL.f. Subroutine VELFDG.f is converted to 
the fully parallel version VELFDG.fcm, which is listed in Figure 4b. This subroutine 
accounts from the effect of compressibility. DO-LOOPS 6, 7, and 8 of subroutine 
VELFDG.f are replaced by FORALL statements in subroutine VELFDG.fcm. 
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CHAPTER 4 


PERFORMANCE ANALYSIS 

Studying each phase of the CM-FORTRAN version of the source field panel 
method shows that the CM-5 has its advantages and disadvantages. The serial 
and parallel codes for the standard panel method and the field-panel method are 
executed on Cray-YMP with a single processor and CM-5 with 32, 64 and 128 
nodes, respectively. The performance on Cray-YMP supercomputer with a single 
processor provides the basis for comparison. The computational performance is 
obtained using Cray-YMP’s PERFTRACE utility. The CPU execution (CM-busy) 
time on the CM-5 is obtained via CM TIMER routines. 

Identical pressure distribution results are obtained using both serial FORTRAN 
code and parallel CM-FORTRAN code for the incompressible flow case. This part 
of results is published by Logan and Hu [7]. Results of the parallel code for the 
compressible flow case are different from the results accrued on the Cray-YMP. 
Performance of the code using 1 iteration and 6 iterations is presented for the 
compressible flow case. 

4.1 Performance of Incompressible Flow Computation 

Results relating to how CM-5 performance compares to the Cray-YMP are 
presented in Tables 1 through 4. Graphs 1 thru 4 are developed from results listed 
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in these corresponding tables. Partially parallelized code as opposed to fully paral- 
lelized code has a significant effect on performance. This effect and other pertinent 
details are presented below. 

Tables 1 thru 3 gives CPU time required for each step of the incompressible 
flow source panel method as mentioned. It lists results obtained for the smaller 
problem where N = 24x12, 288 source panels and the larger problem where N = 
48x24, 1152 source panels. 

Table 1 lists CPU time used to construct the matrices. This segment of code is 
fully parallelized. It is found that CM-5 performs better than Cray-YMP on both 
smaller and larger problems. Scaling the CM-5 up to 64 nodes and then to 128 
nodes results in even faster CPU time. Comparing the Cray-YMP to the CM-5 as 
the problem size increases, it is seen that the Cray-YMP requires approximately 
twenty times as much CPU time to complete the larger problem than it requires to 
complete the smaller problem. However, CM-5 requires at most ten times as much 
CPU time to complete the larger problem as it requires to complete the smaller 
one. Since the CPU time decreases considerably as the number of nodes increases, 
this implies that the code is not only fully parallelized but it is also efficiently 
implemented on the parallel processing unit. 

Table 2 lists CPU time used to solve the matrices. Gaussian elimination is used 
to solve the matrices on the conventional Cray-YMP. A CM-FORTRAN Gaussian 
elimination library routine is used to solve the matrices on the CM-5. Cray-YMP 
performs faster than CM-5 for the smaller problem requiring less than half the time 
it takes CM-5 to solve the problem using 128 nodes. However, CM-5 performs 
faster than Cray-YMP for the larger problem. Also as the problem size increases 


27 



Cray-YMP requires fifty-eight times as much CPU time to solve the larger problem 
than it does to solve the smaller problem. However, CM-5 requires at most six 
times as much CPU time to solve the larger problem than it does to solve the 
smaller problem. CPU time does not consistly decrease as the number of nodes 
used increases. This may be due to CM-5 communication overhead. 

Table 3 lists CPU time used for post-processing. This part of the 
CM-FORTRAN code is partially parallelized. Partially parallelizing the code has 
a significant effect on the performance of CM-5. Thus, Cray-YMP performs faster 
than CM-5 for the smaller and larger problems. It completes the task in approx- 
imately one-tenth the time it takes CM-5 to complete the task. However, as the 
problem size increases Cray-YMP requires approximately seventeen times as much 
CPU time to solve the larger problem than it does to solve the smaller problem. 
CM-5 requires close to four and one-half times as much CPU time to solve the larger 
problem than it does to complete the smaller problem. 

Table 4 lists total CPU time required to solve the incompressible flow part via 
the source panel method. Cray-YMP performs faster than CM-5 on the smaller 
problem. It requires close to one-third times the CPU time it takes CM-5 to solve 
the smaller problem. CM-5 performance may be attributed to the fact that some 
of the code is partially parallelized. Its performance may also be attributed to 
communication overhead. However, CM-5 outperforms Cray-YMP on the larger 
problem. Cray-YMP requires thirty times as much CPU time to solve the larger 
problem than the smaller problem, whereas CM-5 only requires five and one-third 
times as much CPU time to solve the larger problem than the smaller problem. 
Overall, CM-5 outperforms Cray-YMP in implementing the source panel method 
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on incompressible flow problems. 


4.2 Performance of Compressible Flow Computations 

Results for the compressible flow case obtained from the field panel method 
are obtained. The computational results are not identical to the results obtained 
on the Cray-YMP, and some investigations are being made. The performance of 
the CM-5 CM-FORTRAN version of the code is compared with the Cray-YMP 
serial FORTRAN version. The code is executed using 1 iteration and again using 
6 iterations. The problem sizes increases from N = 24 x 12, 288 source panels to N 
= 48x24, 1152 source panels. The results are listed in Tables 5 and 6. Graphs 5 
and 6 re-represents results listed in Table 6. 

Table 5 lists CPU time for evaluating volume integrals. CM-5 outperforms 
Cray-YMP in completing this task. The CM-5 subroutine that completes this task 
is fully parallelized. Also as the number of nodes increases CPU time consistently 
decreases by a considerable amount. The performance of Cray-YMP diminishes on 
the larger problem. However, CM-5 attains the same CPU time it attains for the 
larger problem as it does for the smaller problem. Overall CM-5 outperforms Cray- 
YMP on both problems. More importantly, it does not require additional CPU 
time to solve the larger problem like Cray-YMP does. CM-5 good performance 
can be attributed to the fact that the code is fully parallelized and it is efficiently 
implemented on the parallel processing unit. 

Table 6 lists total CPU time for 6 iterations. Cray-YMP performs better than 
CM-5 on both the smaller and larger problem. This can be attributed to the fact 
that the CM-5 version uses subroutines that are partially parallelized. This causes 
communication overhead to be very high. However, it is noticed that as the problem 
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size increases Cray-YMP, CPU time increases by a factor of four and four-fifths, 
whereas CM- 5 CPU time only increases by a factor of one and one- third. This 
implies that CM-5 may be the best choice for large scale problems, since CPU time 
increases by a small margin compared to Cray-YMP. 
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CHAPTER 5 


CONCLUDING REMARKS 


The study of an integral equation method for a three dimensional aerodynamics 
problem is made in a massively parallel processing (MPP) environment. Serial 
FORTRAN code is converted into parallel CM-FORTRAN code and a comparative 
study is made to determine how well the CM-5 MPP computer performs when 
compared to the conventional Cray-YMP supercomputer. Since CM-5 is intended 
for use on large scale problems, two different problem sizes are tested to see how 
the size of the problem effects the performance. 

From this investigation, the following concluding remarks can be made: (1) 
Whenever the parallel code is fully parallelized, CM-5 out-performs Cray-YMP; (2) 
For the small problem, a partially parallel code is not approriate for CM-5; and (3) 
When the size of the problem increases, the performance of CM-5 increases. For 
further study, it is suggested that code should be fully parallelized in all subroutines 
in order to achieve an overall high performance in a MPP environment. 
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SUBROUTINE VELWING ( IVELCT , IWG , IG, JG, KG) 
COMMON/BLKOl/X (25,13) , Y (25 , 13 ) , Z (25, 13) 


DO 1 JS=1 , NC 
JS1=JS+1 
DO 2 IS=1 , NR 
IS1=IS+1 
X1=X(IS, JS) 


XC= (X1+X2+X3+X4 ) /4 . 0 


DO 11 JR=l,NC/2 

JR1=JR+1 

DO 12 IR=1 , NR 


XF=0 . 2 5* ( X ( IR , JR1 ) +X ( IR1 , JR) +X ( IR , JR) +X ( IR1 , JR1 ) ) 


DX=XF-XC 


DIST=SQRT (DX*DX+DY*DY+DZ*DZ) 

IF (DIST. LT. FARFD) THEN 
IF(DIST.LT. 0.0001) THEN 
UC=0.5*UNX(IS, JS) 

VC=0 . 5*UNY ( IS , JS) 

WC=0 . 5*UNZ (IS, JS) 

ELSE 

CALL VWS (XI , X2 , Z1 , Z3 , Y1 , XF, YF, ZF,IS,JS, UC, VC , WC) 
END IF 
ELSE 

AREAXZ=ABS ( (X2-X1) * (Z3-Z1) ) 

XYN=UNX(IS, JS) /UNY(IS, JS) 

ZYN=UNZ (IS, JS) /UNY(IS, JS) 

FACXZS=SQRT ( 1 . 0+XYN*XYN+ZYN*ZYN) 

AREAS=FACXZ S * AREAX Z 

CONSTFF=OPI4 *AREAS / (DIST*DIST*DIST) 

UC=CONSTFF*DX 
VC=CONSTFF*DY 
WC=CONSTFF*DZ 
END IF 


NBA= ( JS-1) *NR+IS 

A (KEQ, NBA) =A (KEQ, NBA) - (UC*UNX (IR, JR) 

+VC*UNY ( IR, JR) 

+WC*UNZ ( IR, JR) ) 

12 CONTINUE 
11 CONTINUE 
2 CONTINUE 
1 CONTINUE 
RETURN 
END 

Figure la. Serial Version - Constructing [A]. 
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SUBROUTINE velwing ( ivelct , iwg, ig, jg,kg) 
include ' /usr/cm/include/cm/CMF_defs.h' 
COMMON/ BLK01/X (25,13) , Y ( 25 , 13 ) , Z (25 , 13 ) 


real unxm3 (24,12,24) , vnxm3 (24,12,24) , wnxm3 (24,12,24) 


unxm3 = spread(unx( :nr, :nc) ,dim=3,ncopies=24) 


xcm3 = spread (xcm, dim=3 , ncopies=2 4) 


xcm4 =spread ( xcm3 , dim=4 , ncopies=6 ) 


xfm3 = spread (xfm,dim=l,ncopies=12) 


xfm4 = spread(xfm3 ,dim=l,ncopies=24) 


dxm4 = xfm4 -xcm4 


distm4 = sqrt (dxm4*dxm4+dym4*dym4+dzm4*dzin4) 


where (distm4 . It . farfd) 

where (distm4 .LT. 0.0001) 
ucm4 =0.5* unxm4 


elsewhere 

xynm4 = unxm4/vnxm4 
zynm4 = wnxm4/vnxm4 

facxzsm4 = sqrt (1.0 + xynm4 * xynm4 + zynm4 * zynm4) 
ddxm4 = ddxm4 
ddzm4 = ddzm4 

fac4 = opi4*facxzsm4*abs ( ddxm4 *ddzm4 ) /6. 0 

vwx4 = f fxll+ffx21+f fx31+f fx41+ 

& + ffx44+ffxl5+ffx25+ffx35+ffx45 


ucm4 = vwx4*fac4 


endwhere 

elsewhere 

zynni4 = wnxm4/vnxm4 

facxzsm4 = sqrt ( l+xynm4*xynm4 + zynm4*zynm4) 
contm4 = opi4*areasm4/ (distm4*distm4*distm4) 
vcm4 = contm4 *dym4 
endwhere 

forall ( jr=l : nc/2 , ir=l: nr , js=l: nc, is=l:nr) 

& a(ir+( jr-1) *nr, is+( js-1) *nr) = a (ir+ ( jr-1) *nr , is+ ( js-1) *nr) 
& - ( (ucm4 (is, js, ir, jr) 

& *unxd4 (is, js, ir, jr) ) + 

& (vcm4 ( is, js, ir , jr ) *vnxd4 (is, js, ir , jr) ) +(wcm4 (is, js, ir , jr) 

& *wnxd4 (is, js, ir, jr) ) ) 
return 
end 


Figure lb. CM Parallel Version - Constructing [A] . 
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SUBROUTINE PRESS ( ITER, ICS , INC) 
DIMENSION CPU (24,12,40) 

COMMON / BLKO 1/X(25,13) , Y ( 25 , 13 ) , Z (25 , 13 ) 


DO 1 JR=1 , NC/2 
JR1=JR+1 

DO 2 IR= (IUP-1) *NR/2+l , IUP*NR/ 2 
IR1=IR+1 

XF=0 . 2 5* ( X ( IR , JR1 ) +X ( IR1 , JR) +X ( IR, JR) +X ( IR1 , JR1 ) ) 


CALL VELWING( 1,2, 1,1,1) 
OMTRX=BETAR* ( ZF-ZP) -ALFAR* (YF-YP) 


UTRMS=EOX*HAM+OMTRX 


IF ( INC . EQ . 0 ) THEN 
UR=VCX-UTRMS 


UVWR2=UR*UR+VR*VR+WR*WR 
CPU (IR, JR, ITER) =1 . 0 -UVWR2 / HAM2 
ELSE 

CALL VELFDG ( ITER , I CALREC ,IR,JR,1,XF,YF,ZF, 
& UTT , VTT , WTT ) 

UR=VCX+UTT-UTRMS 


UVWR2 =UR * UR+VR* VR+WR *WR 

UVWTRM2 =UTRMS *UTRMS +VTRMS * VTRMS +WTRMS *WTRMS 


RHO= ( 1 . 0+0 . 5*GM1* ( -UVWR2+UVWTRM2-2 . 0*DPHI JR) ) 
+ ** (1. 0/GM1) 

CPU(IR, JR, ITER) =2 . 0/GAMA/HAM2* (RHO**GAMA-l . 0) 
END IF 

UVWR=SQRT ( UVWR2 ) 

HAML=UVWR/(RHO**(0.5*GM1) ) 

END IF 
2 CONTINUE 
1 CONTINUE 
RETURN 
END 


Figure 2a. Serial Version Post-processing. 
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SUBROUTINE press ( iter , ics , inc) 
DIMENSION CPU (24, 12,40) 
integer aaa , bbb , ccc 

COMMON / BLKO 1/X(25, 13) , Y ( 25 , 13 ) , Z (25 , 13 ) 


do jr = 1 , nc / 2 
jrl = jr + 1 
do ir = nr/2+i,nr 
in = ir + 1 

xf = 0.25 * (x(ir,jrl) + x(irl,jr) + x(ir,jr) + 
& x (irl , jrl) ) 


call velwing2 (ccc, bbb, ccc, ccc, ccc) 

omtrx = betar * (zf - zp) - alfar * (yf - yp) 


utrms = eOx * ham + omtrx 


if (inc.eq.O) then 
ur = vex - utrms 


uvwr2 = ur * ur + vr * vr + wr * wr 
cpu(ir, jr, iter) = 1.0 - uvwr2 / ham2 
else 

call velfdg(iter , icalrec, ir , jr, l,xf ,yf , zf , 
& utt,vtt,wtt) 

ur = vex + utt - utrms 


uvwr2 = ur * ur + vr * vr + wr * wr 
uvwtrm2 = utrms * utrms + vtrms * vtrms + 

& wtrms * wtrms 

rho = (1.0 + 0.5 * gml * (-uvwr2 + uvwtrm2 - 2.0 
& * dphijr)) ** (1.0 / gml) 

epu ( ir , jr , iter) = 2.0 / gama / ham2 * (rho ** gama 

& - 1 . 0 ) 

end if 

IF (.NOT. inc .EQ. 0) THEN 
uvwr = sqrt(uvwr2) 

haml = uvwr / (rho ** (0.5 * gml)) 

END IF 
ENDDO 
ENDDO 
RETURN 
END 


Figure 2b. CM Parallel Version - Post-processing. 
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SUBROUTINE GVICAL (IFORS , XRE, YRE, ZRE) 

COMMON/ VELPG/UPG (20,16,18), VPG (20,16,18), WPG (20,16,18) 


IF (IFORS. EQ.O) THEN 
XRE=XA-0. 5*DX 


END IF 
DO 6 J=1 , NY 
DO 7 K=1 , NZ 
DO 8 1=1, NX 

XS=XB+ ( FLOAT ( I ) -0 . 5 ) *DX 


XFF=XRE-XS 


DIST=SQRT(XFF*XFF+YFF*YFF+ZFF*ZFF) 

IF (DIST . GE . FARFD) GOTO 10 

IF (ABS (XFF) .LT. (0.5*DX) . AND. ABS(YFF) .LT. (0.5*DY) 
+. AND. ABS ( ZFF) . LT. ( 0 . 5*DZ) ) THEN 
PDX=0 . 51*DX 
ELSE 

PDX=0 . 0 
END IF 
DO 11 IS=1 , 2 


DO 12 JS=1 , 2 


DO 13 KS=1 , 2 


X2=X*X 


SQU=SQRT ( Y2/X2 ) 


ATYZ=ATAN2 (Z*SQU,RS) 


UU(IS, JS,KS) =Y/SQU*ATYZ-0. 5* ( Y*RSZ+Z*RSY) 


13 CONTINUE 
12 CONTINUE 
11 CONTINUE 

UT=- ( UU (2,2,2) +UU (2,1,1) +UU (1,2,1) +UU (1,1,2) 

+ -UU (2,2,1) -UU (2,1,2) -UU (1,2,2) -UU (1,1,1)) *OPI4 


GOTO 14 

10 VD3=VOL/ (DIST*DIST*DIST) 
UT=OPI4*XFF*VD3 


8 CONTINUE 
7 CONTINUE 
6 CONTINUE 
RETURN 
END 

Figure 3a. Serial Version - Calculating Volume Integrals. 
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SUBROUTINE gvical3 ( if ors , xre, yre, zre) 

COMMON /VELPG/UPG (20, 16, 18) , VPG(20, 16, 18) , WPG (20 , 16 , 18 ) 


real utl (20, 16, 18) , vtl (20 , 16 , 18) ,wtl (20, 16, 18) 


save 

if (ifors .EQ. 0) then 
xre = xa - 0 . 5 * dx 


end if 


zff 100 ( :nx, :ny, :nz) = zre-zslOO ( :nx, :ny, :nz) 

dist = sqrt (xf f 100*xf f 100+yf f 100*yf f 100+zf f 100*zf f 100) 
where(dist. It. farfd) 

where (abs(xffl00( :nx, :ny, :nz) ) .It. (0.5*dx) ) 
where (abs(yffl00( :nx, :ny, :nz) ) .It. (0.5*dy) ) 
where (abs(zffl00( :nx, :ny, :nz) ) .It. (0.5*dz) ) 
pdxlOO ( : nx, :ny, :nz) = 0.51*dx 
elsewhere 

pdxl00( :nx, :ny, :nz) = 0.0 


endwhere 

xl00( :nx, :ny, :nz) = xre - pdxlOO ( :nx, :ny, :nz) 
&- (xslOO ( : nx, :ny, :nz)-0.5*dx) 


squill ( :nx, :ny , :nz) =sqrt ( (ylOO ( :nx, :ny, :nz) * 
& ylOO ( : nx, : ny , : nz) ) 

&/ (xlOO ( :nx, :ny , :nz) *xl00 ( :nx, :ny, :nz) ) ) 


rslll ( : nx, : ny , : nz) =sqrt (xlOO ( :nx, :ny , :nz) * 
& xlOO ( : nx, : ny , : nz) 


atyzlll ( :nx, :ny , :nz) =atan2 (zlOO ( :nx, :ny, :nz) * 
& squlll( :nx, :ny, :nz) , 


&rslll( :nx, :ny, :nz) ) 


uulll( :nx, :ny, :nz)=yl00( :nx, :ny, :nz) / 

& squill (: nx, : ny, : nz) 

&*atyzlll ( :nx, :ny , :nz) -0. 5* (ylOO ( :nx, :ny, :nz) * 


where (dist( :nx, :ny, :nz) .It. farfd) 
utl ( : nx, : ny , :nz) =-(uu222 ( :nx, :ny, :nz) +uu211 ( :nx, :ny, :nz) 


elsewhere 

vd3 ( :nx, :ny, :nz) = vol/ (dist ( :nx, :ny, :nz) *dist ( :nx, :ny, :nz) * 
& dist ( : nx, : ny , : nz) ) 
endwhere 


RETURN 

END 

Figure 3b CM Parallel Version - Calculating Volume Integrals. 
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SUBROUTINE VELFDG ( ITER , ICALREC , IR, JR, KR , XFP , YFP , ZFP , 
& UTT , VTT , WTT ) 

COMMON /BLK01/XW (25,13) , YW ( 25 , 13 ) , ZW (25 , 13 ) 


IF ( ICALREC . EQ . 1 ) THEN 
CALL GVICAL(1,XFP, YFP, ZFP) 


DO 7 JS=1 , NY 
DO 7 KS=1 , NZ 
DO 7 IS=1 , NX 

UTT=UTT+SUPG ( IS , JS , KS) *G(IS, JS,KS) *RATIO 


7 CONTINUE 
END IF 

IF ( ICALREC . EQ . 2 ) THEN 


DO 6 JS=1 , NY 
DO 6 KS=1 , NZ 
DO 6 IS=1 , NX 

UTT=UTT+SU*G ( IS , JS , KS) *RATIO 


6 CONTINUE 
END IF 

IF (ICALREC. EQ.O) THEN 


ID=NX-IR 


DO 8 JS=1 , NY 
DO 8 KS=1 , NZ 
DO 8 IS=1 , NX 
IF(IS.LE.IR) THEN 
I=IS+ID 
SIGI=1 . 0 
ELSE 

IRSD=IS-IR 
I=IR-IRSD+ID 
SIGI=-1 . 0 
END IF 


UTT=UTT+SIGI*UPG ( I , J , K) *G(IS, JS,KS) * RATIO 


8 CONTINUE 
END IF 
RETURN 
END 


Figure 4a. Serial Version - Summing the Volume Integrals. 
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SUBROUTINE velfdg(iter, icalrec, ir , jr ,kr , xfp, yfp, zfp, utt , 
& vtt,wtt) 

COMMON / BLKO 1 / XW (25,13) , YW ( 25 , 13 ) , ZW (25 , 13 ) 


real dis31(ny,nz) ,dsl(ny,nz) 


IF (icalrec .EQ. 1) THEN 
CALL gvical3 (l,xfp,yfp, zfp) 


utt = utt + sum(supg( :nx, :ny, :nz) * g( :nx, :ny, :nz) ) 


END IF 

IF (icalrec .EQ. 2) THEN 


utt = utt+ sum (sulOO ( : nx, : ny , : nz) *g ( : nx, : ny , : nz) ) 


END IF 

IF (icalrec .EQ. 0) THEN 


forall (is=l: ir , js=l:ny,ks=l:nz) 

& im2 (is, js,ks) = is+idm2 (is, js,ks) 
forall ( is=l : ir , js=l : ny , ks=l : nz ) 

& sigi(is, js,ks)=1.0 

forall (is=ir+l:nx, js=l:ny,ks=l:nz) 

& im2 (is, js,ks) =ir-is+ir+idm2 (is, js,ks) 
forall ( is=ir+l : nx, j s=l : ny , ks=l : nz ) 

& sigi (is, js,ks) = -1.0 


forall (is=l:nx, js=l:ny,ks=l:kr) 

& km2(is, js,ks)= ks+kdm2 (is, js,ks) 
forall ( is=l:nx, js=l: ny ,ks=l:kr) 

& sigk(is, js,ks)= 1.0 

forall ( is=l : nx , j s=l : ny , ks=kr+l : nz ) 

& km2 (is, js,ks) = kr-ks+kr+kdm2 (is, js,ks) 
forall ( is=l : nx , j s=l : ny , ks=kr+l : nz ) 

& sigk(is, js,ks) = -1.0 

forall (is=l:nx, js=l:ny,ks=l:nz) 

& upg3 (is, js,ks)= upg(im2 (is, js,ks) , jm2 (is, js,ks) , 

& km2 (is, js,ks) ) 

forall (is=l:nx, js=l:ny,ks=l:nz) 

& vpg3 (is, js,ks)=vpg(im2 (is, js,ks) , jm2 (is, js,ks) , 

& km2 (is, js,ks) ) 

forall ( is=l : nx , j s=l : ny , ks=l : nz) 

& wpg3 (is, js,ks) = wpg (im2 ( is, js,ks) , jm2 (is, js,ks) , 

& km2 (is, js,ks) ) 

utt=utt+sum(sigi ( :nx, : ny , : nz) *upg3 ( :nx, :ny , : nz) * 

& g ( : nx, : ny , : nz) ) 

• • • • 

ENDIF 

RETURN 

END 

Figure 4b. CM Parallel Version-Summing the Volume Integrals. 
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Table 1 . CPU time in seconds for constructing matrices. 


size, N= 



Cray-YMP 

0.44 

8.25 


U.24 

2.34 

64-node CM5 

0.17 

1.20 

1 28-node CM5 

0.12 

0.65 


Table 2. CPU time in seconds for Gaussian elimination. 


size, N= 

288(=24x12) 


Cray-YMP 

0.58 

33.85 


1.43 

8 57 

64-node CM5 

2.11 

7.05 

1 28-node CM5 

1.65 

7.66 i 
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Table 3. CPU time in seconds for post processing. 


size, N= 



Cray-YMP 

0.24 

4.30 


2.18 

9.45 

64-node CM5 

2.21 

9.61 

1 28-node CM5 

2.18 

9.61 


Table 4. Total CPU time in seconds for incompressible flow. 


size, N= 

288 (=24x1 2) 


Cray-YMP 

1.33 

46.60 


3.86 

2055 

64-node CM5 

4.53 

18.01 

1 28-node CM5 

4.00 

18.08 
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Table 5. CPU time in seconds for evaluating volume integrals. 


size, N= 



Cray-YMP 

1.48 

5.60 


0.11 

- 

64-node CM5 

0.07 

0.07 

1 28-node CM5 

0.04 

0.04 


Table 6. Total CPU time in seconds for 6 iterations. 


size, N= 

288(=24x1 2) 


Cray-YMP 

86.0 

412.0 


1416.0 

- 

64-node CM5 

1300.0 

1634.0 

1 28-node CM5 

1202.0 

1396.0 
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No. of Blocks 


Graph 1. Effect of number of blocks in Gaussian elimination, N=24xl2. 



Graph 2. Effect of number of blocks in Gaussian elimination, N=48x24. 
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Graph 3. CPU time for constructing matrices, N=48x24. 
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Graph 4. CPU time for evaluating volume integrals, N=24x1 2. 
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Graph 5. CPU time for compressible flows with 6 iterations, N=24x1 2. 
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Graph 6. CPU time for compressible flows with 6 iterations, N=48x24. 
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ABSTRACT 


PERFORMANCE ANALYSIS OF THREE DIMENSIONAL 
INTEGRAL EQUATION COMPUTATIONS ON 
A MASSIVELY PARALLEL COMPUTER 

Student: Terry G. Logan, Department: Mathematics 
Degree: Master of Science, August 1994 

The purpose of this study is to investigate the performance of the integral 
equation computations using numerical source field-panel method in a massively 
parallel processing (MPP) environment. A comparative study of computational 
performance of the MPP CM-5 computer and conventional Cray-YMP supercom- 
puter for a three-dimensional flow problem is made. A serial FORTRAN code is 
converted into a parallel CM-FORTRAN code. Some performance results are ob- 
tained on CM-5 with 32, 64, 128 nodes along with those on Cray-YMP with a 
single processor. The comparison of the performance indicates that the parallel 
CM-FORTRAN code near or out-performs the equivalent serial FORTRAN code 
for some cases. 



